Skip to content

Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types#1347

Open
kosiew wants to merge 7 commits intoapache:mainfrom
kosiew:typeconversion-issue-1339
Open

Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types#1347
kosiew wants to merge 7 commits intoapache:mainfrom
kosiew:typeconversion-issue-1339

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Jan 20, 2026

Which issue does this PR close?

Rationale for this change

Python UDAFs that logically return a list (e.g., “collect all timestamps for a group”) were easy to implement incorrectly by returning a pyarrow.Array (or ChunkedArray) directly from Accumulator.evaluate(). In that case, DataFusion’s Python ↔ Rust conversion path could attempt to treat the array object as an integer-like scalar, leading to confusing conversion failures such as:

  • ArrowTypeError: object of type <class 'pyarrow.lib.TimestampArray'> cannot be converted to int

To make list-returning UDAFs work reliably (and make the contract explicit), we now ensure the UDAF result is a list-valued scalar that matches the declared Arrow return type.

What changes are included in this PR?

  • Documentation & API contract clarification

    • Added a FAQ entry explaining how to return a list from a UDAF (return a list-valued pa.scalar and declare list types for return_type and state_type).
    • Expanded Accumulator.evaluate() docstring to explicitly state that returning a pyarrow.Array is not supported unless converted to a list-valued scalar.
  • Runtime behavior improvements in the Rust accumulator bridge

    • In RustAccumulator.evaluate, detect when the declared return type is a list (List, LargeList, FixedSizeList) and the Python evaluate() result is pyarrow.Array/pyarrow.ChunkedArray.
    • Convert such array-like values into a list-valued pyarrow.scalar(..., type=<declared list type>) before extracting the scalar value for DataFusion.
    • Cache the pyarrow.Array and pyarrow.ChunkedArray Python types inside the accumulator to avoid repeated pyarrow imports/type lookups on every evaluate call.
  • Tests

    • Added a regression test (test_udaf_list_timestamp_return) covering a UDAF that returns a list of timestamp(ns) values.
    • Added a small CollectTimestamps accumulator used by the test that maintains list state as a list-valued scalar and returns a list-valued scalar from evaluate().

Are these changes tested?

Yes.

  • Added a new regression test verifying that a UDAF can return list<timestamp(ns)> and that the collected results match the expected list-of-timestamps output.

Are there any user-facing changes?

Yes.

  • Docs: Added guidance in the UDF/UDAF user guide on returning lists from UDAFs.
  • Behavior: If a user returns a pyarrow.Array/ChunkedArray from evaluate() for a list-typed UDAF, it is now converted into the correct list-valued scalar automatically (rather than failing with a type conversion error).
  • No breaking API changes are intended; this clarifies and enforces the existing contract that evaluate() must return a scalar.

LLM-generated code disclosure

This PR includes code, comments generated with assistance from LLM. All LLM-generated content has been manually reviewed and tested.

Store UDAF return type in Rust accumulator and wrap
pyarrow Array/ChunkedArray returns into list scalars
for list-like return types. Add a UDAF test to return
a list of timestamps via a pyarrow array, validating
the aggregate output for correctness.
Add documented list-valued scalar returns for UDAF
accumulators, including an example with pa.scalar and a note
about unsupported pyarrow.Array returns from evaluate().
Also, introduce a UDAF FAQ entry detailing list-returning
patterns and required return_type/state_type definitions.
…nbinding and binding fresh copies when checking array-likeness, eliminating the Bound reference error
@kosiew kosiew marked this pull request as ready for review January 27, 2026 03:08
@timsaucer
Copy link
Member

timsaucer commented Feb 4, 2026

Sorry it's taken me a while to get around to this PR. It feels like we are doing two different things

  1. telling users that they need to return pyarrow scalars as the return of evaluate
  2. detect when it is a list and then we convert it to a python value and back into a pyarrow scalar

It feels like this isn't the best option. I think we want to avoid doing any kind of to_pylist() calls.

I think a more general solution would be something like

  1. Determine if they have passed in a pyarrow scalar value. If so, use it.
  2. If they have not passed in a pyarrow scalar value, use py_obj_to_scalar_value to convert to a scalar value
  3. Update py_obj_to_scalar_value to detect pyarrow arrays and convert them to ScalarValue::List

For the last part we could do something like

    if obj.hasattr("__arrow_c_array__")? {
        let array_data = ArrayData::from_pyarrow_bound(&obj)?;

        let array = make_array(array_data);

        // ScalarValue::ListArray must be a list of length 1
        let offsets = OffsetBuffer::new(ScalarBuffer::from(vec![0, array.len() as i32]));
        let list_array = Arc::new(ListArray::new(
            Arc::new(Field::new_list_field(array.data_type().clone(), true)),
            offsets,
            array,
            None,
        ));

        return Ok(ScalarValue::List(list_array));
    }

Additionally, if we're going to go down this route I think we would want to treat both the state() and evaluate() since both of them should be returning scalars.

An advantage of the point described above is that I think it adds more flexibility to the users because their python functions can just return python integers and such without having to convert them to pyarrow scalars.

What do you think?

@timsaucer
Copy link
Member

One problem I see with my answer above ^ is that some libraries like nanoarrow DO implement __arrow_c_array__ on a scalar value, and we wouldn't want to accidentally turn that into a ScalarValue::List.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cannot do udaf that returns list of timestamps

2 participants